Updated “6/26/2020”

Why vizualize data?

  1. To explore relationships in our data
  2. To communicate/present what we found

Why ggplot?

  • Easy to quickly generate plots which can help us do # 1
  • Flexible and good defaults which help us with # 2

Packages we need

# install.packages("ggplot2")
# install.packages("tidyverse")
remotes::install_github("allisonhorst/palmerpenguins")
## Skipping install of 'palmerpenguins' from a github remote, the SHA1 (63696526) has not changed since last install.
##   Use `force = TRUE` to force installation
library(ggplot2)
library(palmerpenguins)

The basics of the ggplot = the dataframe

The basics of the ggplot = the dataframe

Note: I originally used the iris dataset, published by Ronald Fisher in the Annals of Eugenics in 1936. While the data themselves pertain to flowers, we cannot strip this data of it’s context, And given the joyous alternative of penguins and many others (use data() to check out all the other datasets available that are not iris), this is an easy thing to do.

data()

The basics of the ggplot = the dataframe

  • A dataframe where each row corresponds to a record (‘wide format’) and variables are columns
ggplot(data = penguins)

Creates a blank plot. We need to decide what variables we want to plot.

The basics of the ggplot = the aes

  • aes = the aesthetics.
head(penguins)
## # A tibble: 6 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male

Let’s look at the relationship between the variables bill_length_mm and bill_depth_mm.

The basics of the ggplot = the aes

  • aes = the aesthetics (what we want to plot)
## Let's try the bill_length_mm vs. the bill_depth_mm (x, y plot)
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) 

Still blank! We need to decide what kind of plot to use.

How do we figure out what kind of plot to use?

We need to think about how best to visualize this data.

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex     
##  Min.   :172.0     Min.   :2700   female:165  
##  1st Qu.:190.0     1st Qu.:3550   male  :168  
##  Median :197.0     Median :4050   NA's  : 11  
##  Mean   :200.9     Mean   :4202               
##  3rd Qu.:213.0     3rd Qu.:4750               
##  Max.   :231.0     Max.   :6300               
##  NA's   :2         NA's   :2

How do we figure out what kind of plot to use?

We need to think about what we want to show.

Quick excercise: try drawing with pen and paper what you think these plots might look like:

  • Relationship between bill_length_mm and bill_depth_mm
  • Distribution of bill_length_mm
  • Comparing the means of bill_length_mm by species

How do we figure out what kind of plot to use?

We need to think about what we want to show.

Quick excercise: try drawing with pen and paper what you think these plots might look like:

  • Relationship between bill_length_mm and bill_depth_mm
  • Distribution of bill_length_mm
  • Comparing the means of bill_length_mm by species

Hint Here are the options: Histogram - XY Scatter plot - Boxplot.

The basics of ggplot = the geom

  • Geom = geometry

  • The ‘geom_’ functions choose what kind of graph we want to show:

    • geom_point (point graphs, xy relationships)
    • geom_histogram (distribution of a continuous variable)
    • geom_boxplot (comparing means of groups)
    • geom_col/geom_bar
    • geom_line (timeseries)
    • geom_tile (heatmaps)
    • geom_polygon/geom_sf() (mapping)

geom_point: XY Relationship

## add the geom that we want
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point()

Is there anything you notice?

Adding other aesthetics to help visualize potential relationships

Using color, we want to differentiate the points by species.

## 
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()

Adding other aesthetics to help visualize potential relationships

  • The common aesthetics you can add:
    • color
    • fill
    • shape
    • linewidth
    • alpha (transparency)

Adding other aesthetics to help visualize potential relationships

Now try it with shape and fill.

## 
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm, shape = species)) +
  geom_point()

Need to trouble shoot a bit to see what the best option is for a given geom.

Specifying things within vs. outside the aes

  • Outside aes
## try it outside
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm), color = species) +
  geom_point()

Specifying things within vs. outside the aes

  • As part of geom
## add it to the geom 
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(color = "red")

Specifying things within vs. outside the aes

  • Inside aes = the way to specify an aesthetic by a variable in the dataframe.
## 
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()

Try playing around with the aesthetics and see what you get!

Going a bit further: comparing means

## 
ggplot(data = penguins, aes(x = species, y = bill_depth_mm)) +
  geom_boxplot()

  • boxplots visualize the mean (in ggplot, also the 25 and 75% quantiles and any points outside of those–considered outliers)

Going a bit further: distributions

## The histogram
ggplot(data = penguins, aes(x = bill_depth_mm)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Play around with the binwidth option to see what the best bin is.

Going a bit further: distributions by categorical variable

## The histogram
ggplot(data = penguins, aes(x = bill_depth_mm, fill = species)) +
    geom_histogram(binwidth = 0.25)

Going a bit further: distributions by categorical variable

## The density plot
ggplot(data = penguins, aes(x = bill_depth_mm, fill = species)) +
  geom_density()

Going a bit further: creating multiple plots with facetting

  • If we want to split this plot by species to make it easier to see the distributions
## By fill
ggplot(data = penguins, aes(x = bill_depth_mm, fill = species)) +
  geom_histogram(binwidth = 0.25)

Going a bit further: creating multiple plots with facetting

## By facetting
ggplot(data = penguins, aes(x = bill_depth_mm)) +
  geom_histogram(binwidth = 0.25) +
  facet_wrap(~ species)

Customizing our ggplot

  • We’ll save our ggplot as an object called base.
base <- ggplot(data = penguins, aes(x = species, y = bill_depth_mm, color = species)) +
  geom_boxplot()

Customizing: labels

  • Then we can change specific components of it by adding customizations.
base +
  xlab("Species") +
  ylab("Bill Depth")

Customizing: color scales

base +
  xlab("Species") +
  ylab("Bill depth") +
  scale_color_manual(values = c("purple", "blue", "lightblue"), name = "Species of Penguin")

Customizing: color scales

  • The scale_ functions help us specify what colors we want to use
    • scale_fill_gradient
    • scale_fill_continuous
  • The viridis color scales are a good option to try for continuous variables! (scale_fill_viridis)

Customizing: themes

base +
  theme_bw()

Try theme_dark and theme_classic.

Saving our plots

base <-  ggplot(data = penguins, aes(x = species, y = bill_depth_mm, color = species)) +
  geom_boxplot() + 
  xlab("Species") +
  ylab("Bill depth") +
  scale_color_manual(values = c("purple", "blue", "lightblue"), name = "Species of Penguin") +
  theme_bw()

ggsave("base_plot.pdf", base, device = "pdf")
## Saving 7.5 x 4.5 in image

Customizing…the options are almost unlimited!

  • For most of the routine things you need to do, there is a way. To figure it out your best friend is stack exchange and the ggplot documentation!

Other considerations: communicating accurately and effectively

  • Anscombe’s quartet
    • Identical means, variance, and correlation, but when you plot it you see the differences.

Other considerations: communicating accurately and effectively

  • Scales of axes: should they include zero?

Other considerations: communicating accurately and effectively

  • Scales of axes: should they include zero?

Other considerations: communicating accurately and effectively

  • Scales of axes: should they include zero?

Other considerations: communicating accurately and effectively

  • Scales of axes: should they include zero?

Exercises

  • Using the RxP data, plot:
    • The distribution of Snout-vent length at emergence (SVL.initial)
    • The correlation between Snout-vent length at emergence and at the end of the experiment (SVL.initial and SVL.final)
    • The mean of SVL.final by Res
    • The mean of SVL.final by Pred
    • Advanced: the mean difference between final and initial SVL for Res and Pred

How would you plot your own data?

More resources

References for images